Motif Matching Using Gapped Patterns
نویسندگان
چکیده
We present new algorithms for the problem of multiple string matching of gapped patterns, where a gapped pattern is a sequence of strings such that there is a gap of fixed length between each two consecutive strings. The problem has applications in the discovery of transcription factor binding sites in DNA sequences when using generalized versions of the Position Weight Matrix model to describe transcription factor specificities. We present a simple practical algorithm, based on bit-parallelism, that, given a text of length n, runs in time O(n(log σ + gw-spandk-len(P)/we) + occ), where occ is the number of occurrences of the patterns in the text, k-len(P) is the total number of strings in the patterns and 1 ≤ gw-span ≤ w is the maximum number of distinct gap lengths that span a single word of w bits in our encoding. We then show how to improve the time complexity to O(n(log σ + log gsize(P)dk-len(P)/we) + occ) in the worst-case, where gsize(P) is the size of the variation range of the gap lengths. Finally, by parallelizing in a different order we obtain O(dn/welen(P)+n+occ) time, where len(P) is the total number of alphabet symbols in the patterns. We also provide experimental results which show that the presented algorithms are fast in practice, and preferable if all the strings in the patterns have unit-length.
منابع مشابه
Indexing Gapped-Factors Using a Tree
We present a data structure to index a specific kind of factors, that is of substrings, called gapped-factors. A gapped-factor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gapped-factors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in ...
متن کاملThe gapped-factor tree
We present a data structure to index a specific kind of factors, that is of substrings, called gapped-factors. A gapped-factor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gapped-factors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in ...
متن کاملDiscovering sequence motifs of different patterns parallel using DNA operations
Discovery of motifs in biological sequences and various types of subsequences in commercial databases have varied applications and interpretations. This paper proposes a new approach to solve the Combinatorial Pattern Matching (CPM), search for continuous and gapped rigid subsequences and discover Longest Common Rigid Subsequences (LCRS) from the given sequences using DNA operations and modifie...
متن کاملOn the complexity of finding gapped motifs
A gapped pattern is a sequence consisting of regular alphabet symbols and of joker symbols that match any alphabet symbol. The content of a gapped pattern is defined as the number of its non-joker symbols. A gapped motif is a gapped pattern that occurs repeatedly in a string or in a set of strings. The aim of this paper is to study the complexity of several gapped motif finding problems. The fo...
متن کاملEfficient Learning of Semi-structured Data from Queries
This paper studies the learning complexity of classes of structured patterns for HTML/ XML-trees in the query learning framework of Angluin. We present polynomial time learning algorithms for ordered gapped tree patterns, OGT, and ordered gapped forests, OGF, under the into-matching semantics using equivalence queries and subset queries. As a corollary, the learnability with equivalence and mem...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Theor. Comput. Sci.
دوره 548 شماره
صفحات -
تاریخ انتشار 2013